Nonprofits and Data

last person joined: 8 days ago 

This group is for those interested in learning and sharing about all things data-related for nonprofits. The Nonprofits and Data group is for people using data to serve a mission, either directly or by improving nonprofits and the nonprofit sector. That includes everything from collecting data and managing databases to analytics, data visualization and data mining. Here are some examples of topics we discuss: using data to improve organizational effectiveness, measuring impact, using data for storytelling, tools for data management and analysis, figuring out the “right” data to collect, and learning skills to help us use data better.

Tool for large email list (1 million+) deduplication

  • 1.  Tool for large email list (1 million+) deduplication

    Posted 15 days ago
    Hello All,

    Does anyone have a recommendation for a tool to dedupe a large text file of email addresses? I've found free dedupe services online like List Scrubber but am hesitant to paste our entire email list on a site when I don't know where our data will end up. Any of the paid data cleansing products I've found have way more functionality than I need. I'm looking for a simple, secure and cheap solution for $25 or less.

    Many thanks!

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------


  • 2.  RE: Tool for large email list (1 million+) deduplication

    Posted 14 days ago
    Edited by Dave Boyce 14 days ago
    Hello Denise,

    Microsoft Excel has a remove duplicates function

    1. Paste Emails into Excel or open the CSV file in Excel
    2. Select 'Remove Duplicates' from the Data menu or by typing in the Excel Search
    3. Specify Email Column.

    DAVE BOYCE | Data Solutions Consultant
    501 Commons - Seattle, WA

    ------------------------------
    Dave Boyce
    ------------------------------



  • 3.  RE: Tool for large email list (1 million+) deduplication

    Posted 12 days ago

    Thanks, Dave. Unfortunately, my file is too large for Excel. It's actually over 2 million emails. I think I've found a workaround in Access for now. I can import the list and do and in a query, change unique values to "Yes" on the Properties table. That should hold us for a while until we outgrow the 2GB capacity.

     

    Thanks again!

     

    Denise Cummings

    Data Systems Administrator

     

    Friends of the Earth

    Email: dcummings@foe.org

    Direct: 202-222-0718

    www.foe.org

     

     






  • 4.  RE: Tool for large email list (1 million+) deduplication

    Posted 11 days ago
    Maybe use Power BI for the de-duplication. I would presume if you can do it in Access the file size should not be an issue and you wouldn't have to take the step of creating the unique "Yes" field. Not that that will save you that much time, but another option.

    Dan

    --
    Bennington College logovertical black pipe, a design featureDan Snyder 
    Assistant Director of Advancement Services 
    Office of Institutional Advancement 

    Bennington College 

    One College Drive 
    Bennington, VT 05201 
    Office: +1 (802) 447-4238 
    Fax: +1 (802) 440-4351

    To make a secure online gift click hereOr, for more information on ways to make a gift, including stock transfer instructions, please visit our website.





  • 5.  RE: Tool for large email list (1 million+) deduplication

    Posted 11 days ago
    Thanks, Daniel. I just downloaded the free version.

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------



  • 6.  RE: Tool for large email list (1 million+) deduplication

    Posted 10 days ago
    Hello:

    I'd like to revise David B's Excel solution by breaking the list down into two or three smaller lists - small enough that XL can handle them. Let XL sort each sub list alphabetically. Then take the top 1/3 of each sub list (A-J for example) and combine them into one (holding the top third of the alphabet) and de-duplicate that portion. Repeat as needed.

    I do not recommend feeding them all through your Outlook application. Outlook will handle them. Your ISP/email provider, probably, will not.

    Former XL support geek...

    ------------------------------
    DJ Brown
    Holder of multiple IT Hats.
    ISO of SAA, Inc.
    Houston, TX
    ------------------------------



  • 7.  RE: Tool for large email list (1 million+) deduplication

    Posted 11 days ago
    Hi Denise,
    You might look into using the open source programming language Python for things like this. Python is free. It definitely requires some investment in learning to code, but I have found it well, well worth my time for working with data - all the time I spent learning it has been recouped 100x over.  I started with the book "automate the boring stuff with Python" and a couple of introductory books. Once you learn the basics, the possibilities are limitless.
    Ada

    ------------------------------
    Ada Welch
    Director of Planning and Evaluation
    Center for Urban Community Services
    New York, New York
    ------------------------------



  • 8.  RE: Tool for large email list (1 million+) deduplication

    Posted 11 days ago
    Thanks, Ada. I'll definitely look into it.

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------



  • 9.  RE: Tool for large email list (1 million+) deduplication

    Posted 11 days ago
    Maybe I missed something, but what are you using to manage your e-mails?  I'd be surprised if whatever system you are using for e-mail marketing doesn't have a function to handle duplicate e-mail addresses and it's not like you can just e-mail 2 million+ people out of your inbox.  That would also help you prevent duplicates moving forward as you add new people to your list.

    Or are you in between systems and that's why you're using Access?  I think you can still use SQL in Access -- again, you'd have to learn it or someone would have to know it, but you could then run a query to find and remove the duplicates on the data you have rather than having to create an extra field to keep track of that.  If this is a one-time situation, it might be cheaper to hire a freelancer who already knows SQL or Python or whatever to do it for you, because it should be fairly quick for them if it's based on something simple and straightforward like duplicate e-mail addresses and you're just removing whatever is not the first instance (as opposed to some other criteria of which record to keep).

    If this is not a one-time situation though, I might consider revisiting your data management plans...

    ------------------------------
    Janice Chan
    Co-Organizer, NTEN Nonprofits and Data group
    Twitter: @curiositybone

    Consultant, Shift and Scaffold
    www.shiftandscaffold.com
    ------------------------------